fix(efcore): support DTFx distributed tracing without breaking execution#108
Merged
Conversation
Enabling DurableTask.Core distributed tracing (CorrelationSettings.Current.EnableDistributedTracing = true) hung every orchestration on the EFCore backend instead of merely fragmenting spans. Root cause: the backend never populated TaskOrchestrationWorkItem.TraceContext. When tracing is on, TaskOrchestrationDispatcher seeds CorrelationTraceContext.Current from workItem.TraceContext (null) and then dereferences it unconditionally (ExecutionStartedEvent.Correlation = CorrelationTraceContext.Current.SerializableTraceContext), throwing a NullReferenceException. The work item is aborted and retried forever, so the orchestration never completes. This legacy App-Insights correlation path runs even under the W3CTraceContext protocol. Fix: in LockNextTaskOrchestrationWorkItemAsync, restore the work item's trace context from the ExecutionStartedEvent's correlation payload (TraceContextBase.Restore, which returns a valid empty context when there is none), mirroring the reference backends. Guarded on EnableDistributedTracing so the tracing-off default stays zero-overhead. The W3C span carriers (ExecutionStartedEvent.ParentTraceContext / TaskScheduledEvent.ParentTraceContext) already round-trip through the Newtonsoft serializer, so no schema, column, or serialization changes are needed — the entire bug was the null TraceContext. Adds an acceptance test (InMemory + Postgres + SqlServer + MySql variants) asserting execution completes and that client -> orchestration -> activity spans all share the caller's root trace id. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner
|
Thanks for the improvement! |
Owner
|
LGTM. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Enabling the Durable Task Framework's built-in distributed tracing
(
CorrelationSettings.Current.EnableDistributedTracing = true,Protocol.W3CTraceContext) against the EFCore backend doesn't just fragmentspans — it breaks orchestration execution entirely. A trivial saga (client
starts an orchestration that schedules one activity) hangs and never
completes; the worker polls the database forever. With tracing off it
completes in well under a second.
Root cause
The EFCore backend never populates
TaskOrchestrationWorkItem.TraceContext.When tracing is enabled,
DurableTask.Core.TaskOrchestrationDispatcherseeds anAsyncLocalfrom the work item:and later dereferences it unconditionally:
The resulting
NullReferenceExceptionaborts the work item, which is thenre-fetched and re-thrown forever — the observed infinite polling / hang. Note
this is the framework's legacy App-Insights correlation path, which executes
even when the protocol is
W3CTraceContext, so it trips as soon as tracing isturned on.
The client-side
create_orchestrationspan still works because it capturesActivity.Currentbefore storage; the worker-sideorchestrationandactivityspans never appear because the work item never processessuccessfully.
Fix
In
LockNextTaskOrchestrationWorkItemAsync, restore the work item's tracecontext from the
ExecutionStartedEvent's correlation payload, mirroring thereference backends (Azure Storage, MSSQL):
TraceContextBase.Restore(null)returns a valid empty context, and on laterturns it restores the correlation the dispatcher persisted onto
ExecutionStartedEvent.Correlation— giving proper cross-turn continuity. Theguard keeps the tracing-off default allocation-free and behaviourally identical.
Notably no schema/serialization changes
The W3C span carriers —
ExecutionStartedEvent.ParentTraceContextandTaskScheduledEvent.ParentTraceContext— already round-trip through theNewtonsoft
TypelessJsonDataConverter(they're plain public properties). Oncethe null
TraceContextNRE is removed, the OpenTelemetry spans connect on theirown. No new columns, no model changes, no serializer changes. Activity work
items need nothing extra (the activity dispatcher already reads
workItem.TraceContextBase?.null-safely).Tests
Adds a distributed-tracing acceptance test following the existing per-storage
convention (base + InMemory/Postgres/SqlServer/MySql variants). It runs the
saga under a root
System.Diagnostics.Activitywith anActivityListeneron"DurableTask.Core"and asserts:create_orchestration→orchestration→activityspans all share thecaller's root trace id.
The test class runs in a
DisableParallelizationcollection becauseCorrelationSettings.Currentis a process-wide static.Verification
confirmed on InMemory and real Postgres (all three spans on one trace
id, execution completes).
pass with tracing off; full solution builds with 0 warnings/0 errors on
net8.0/net9.0/net10.0.
Fixed against
Microsoft.Azure.DurableTask.Core3.7.0 (works within its tracingmodel; no newer core required).
🤖 Generated with Claude Code